Andres Delgadillo

Project Supervised Learning : Regression

1 Cars4U Project

1.1 Objective

1.2 Data:

1.3 Problem definition and questions to be answered

In this project we want to analyze how different characteristics of used cars impact the Price of the car. The questions to be answer are:

2 Import packages and turnoff warnings

3 Import dataset and quality of data

4 Characteristics of the data

5 Processing columns

5.1 Mileage

This column is the standard mileage offered by the car company in kmpl or km/kg. We are going to split the column between values and units to see if there is a relation between Fuel_Type and Mileage

There is a clear relation between 'Fuel_Type' and 'Unit'.

Now, we can convert Mileage to numeric and drop the Unit column

5.2 Engine

'CC' string is going to be deleted

5.3 Power

'bhp' string is going to be deleted

5.4 New_Price

'Lakh' and 'Cr' strings are going to be deleted.

5.5 Featuring Engineering

Name

The Name column represents Brand, Model and Specs of the car. We are going to split this column in 3 columns to get that information

Now, we can drop 'Name' column and use 'Brand', 'Model' and 'Specs' columns

5.6 Category columns

'Brand', 'Model', 'Specs', 'Location', 'Fuel_Type', 'Transmission', and 'Owner_Type' columns are transformed to category

5.7 Drop 'S.No.' column

5.8 Duplicate rows

5.8 Check characteristics of data after processing

Data series are the correct Type.

6 Exploratory data analysis

6.1 Pandas profiling report

We can get a first statistical and descriptive analysis using pandas_profiling

6.2 Pairplot

We are going to perform univariate and bivariate analysis to understand the relationship between the columns

6.3 Univariate analysis

6.3.1 Numerical columns

6.3.1.1 Year

The Year distribution is slightly skewed to the left. The mean is 2013.36 and the median 2014, and there are not outliers.

6.3.1.2 Kilometers_Driven

The Kilometers_Driven distribution is highly skewed to the right. The mean is 58,699 km, the median 53,416 km, and there are several outliers as we can see in the chart below.

6.3.1.3 Mileage

The Mileage distribution is fairly symmetrical. The mean is 18.14 and the median 18.16. However, there are 81 rows with value equal to 0

6.3.1.4 Engine|

The Engine distribution is skewed to the right. The mean is 1616 and the median 1493

Engine has several values that are flagged as suspicious by the boxplot. However, those values are consistent with some powerful car models and we cannot considered them as outliers

6.3.1.5 Power

The Power distribution is skewed to the right. The mean is 112 and the median 94

At the same as Engine. Power has several values that are flagged as suspicious by the boxplot. However, those values are consistent with some powerful car models and we cannot considered them as outliers

6.3.1.6 Seats

6047 cars (83.4%) have 5 seats. There is one car with 0 seats and 53 with missing values.

6.3.1.7 New_Price

The New_Price distribution is skewed to the right. The mean is 22.7 and the median 11.5

There are several values flagged as suspicious by the boxplot, but they could correspond to luxury cars, and we cannot considered as outliers

6.3.1.8 Price

The Price distribution is skewed to the right. The mean is 9.4 and the median 5.6

Similar than New_Price. There are several values flagged as suspicious by the boxplot, but they could correspond to luxury cars, and we cannot considered as outliers

Categorical columns

6.3.2.1 Location

There are 11 distinct locations. Mumbai is the most frequent location, and Ahmedabad the least frequent

6.3.2.2 Transmission

There are 2 distinct Transmission values, Manual and Automatic. Manual corresponds to the 72% of the cars

6.3.2.3 Owner Type

There are 4 distinct categories for owner type. First owner corresponds to 82% of the rows

6.3.2.4 Fuel Type

There are 5 distinct Fuel Types. Diesel is the most frequent location, and there are only 2 electric cars

6.3.2.5 Brand

There are 33 distinct Brands. Maruti, Hyundai, Honda and Toyota the most common ones.

6.4 Bivariate analysis

6.4.1 Engine, Power and Price relationship

There is a strong correlation between Power and Engine. The chart is also showing that more expensive cars tend to have high values for Power and Engine

6.4.2 Power, Seats and Price relationship

There is not a clear relationship between Power and Seats. However, cars with 2 seats could have strong power and higher prices.

6.4.3 Price and Brand

This chart shows there are:

6.4.4 Price, Location and Fuel Type

7 Missing Value Treatment

First we are going to drop column New_Price since it has 6247(86.1%) rows with missing data.

There are 1234 rows with missing Prices. We are going to drop all those rows because Price is the variable we would like to predict and we don't want to create artificial information in the model

Let's check new data set

We are going to analyze if there is a pattern for the 36 rows with 3 missing values.

Now, we are going to get the columns with missing values

Now, let's calculate the percentage of missing values per column

There are no data missing and we can continue with the analysis

8 Log Transformation

8.1 Kilometers_Driven

Kilometers_Driven column is very skewed. We are going to use the log transformation to improve the distribution

We can see a very good improvement in the distribution. Now, we are going to create a new column with the log of Kilometers_Driven and drop the Kilometers_Driven column

There are several values flagged as suspicious by the boxplot for the Kilometers_Driven_log column. There are some outliers above 14, but the rest of the points aren't inconsistent with the overall distribution of the data.

8.2 Power

Power column is skewed. We are going to use the log transformation to improve the distribution

We can see an improvement in the distribution. Now, we are going to create a new column with the log of Power and drop the Power column

There are several values flagged as suspicious by the boxplot for the Power_log column. However, those points aren't inconsistent with the overall distribution of the data.

8.3 Engine

Engine column is skewed. However, the log transformation does not improve the distribution

We do not see an improvement in the distribution, and we are going to keep the original column

8.4 Price

Price column is skewed. We are going to use the log transformation to improve the distribution

We can see an improvement in the distribution. Now, we are going to create a new column with the log of Price and drop the Price column

There are several values flagged as suspicious by the boxplot for the Price_log column. However, those points aren't inconsistent with the overall distribution of the data.

9 Outliers Treatment

9.1 Kilometers_Driven

Kilometers_Driven_log have some outliers above 14. We are going to replace those values with the median

9.2 Mileage

Mileage column have several rows with value equals zero. We are going to replace those values with the median

9.3 Seats

There is one 1 car with 0 seats. We are going to replace this value with the mean

10 Model Building

First, we are going to drop column Specs because it has high cardinality (1893 distinct values)

10.1 Define independent and dependent variables

10.2 Creating dummy variables

The independent set has 6019 rows and 274 columns

10.3 Split the data into train and test

10.4 Fitting a linear model

Now, we are going to run the linear regression using the train data set

10.5 Performance of the model

First, we are going to calculate the $R^2$ for the train and test sets

The $R^2$ for the train set is 0.958 and for the test set is 0.958. Both values are comparable and very similar. Therefore, the model is not overfitting and the performance is very good

10.5.1 Performance metrics

User functions to calculate performance metrics

We can conclude that the model is not overfitting since all metrics are comparable in both train and test sets. The model is able to predict Prices with a mean error of 0.129 on the test set

10.5.2 Residuals distribution

Train set residuals

Now, we are going to analyze the distribution of the residuals

The scatter plot is random and therefore the model does not violate the assumption of Homoscedasticity

Test set residuals

The scatter plot is random and therefore the model does not violate the assumption of Homoscedasticity

10.6 Coefficients and Intercept of the model

10.6.1 Coefficients Interpretation

Positive impact

This is the list of coefficients with positive impact on prices. Among them are Year, Mileage and Power_log. Increase in these will lead to an increase in the price.

Negative impact

This is the list of coefficients with negative impact on prices. Among them are Kilometers_Drive_log, Engine and Seats. Increase in these will lead to a decrease in the price

10.6.2 Analysis of coefficients

11 Forward Feature Selection

11.1 Identify most important features

Now, we are going to plot the score vs number of features

With 50 features the score is around 0.93, and it does not improve significantly with additional features. Actually, it decreases after 265 features.

11.2 Retraining the model

New independent train and test sets with the 50 variables selected in the sequential feature selection

11.2.1 Model Performance

Observations

11.3 Coefficient Interpretation

11.3.1 Positive impact

This is the list of coefficients with positive impact on prices. Among them are Year, Power and Seats. Increase in these will lead to an increase in the price.

11.3.2 Negative impact

This is the list of coefficients with negative impact on prices. Among them are Mileage, Engine and Kilometers_Drive_log. Increase in these will lead to a decrease in the price

11.3.3 Observations

The impact of the different features on Price is similar than the original regression model

12 Actionable Insights & Recommendations

Cars4U should focus on trade:

Cars4U should avoid: